1 Introduction

In this session we will cover a few topics on data visualization.

1. We are going to recreate a protein-protein interaction network plot I’ve made for a paper published recently. For that, we are going to use Cytoscape.

2. We will explore different types of variables and plot them using the R language environment. More specifically we will use RStudio Cloud and a package called ggplot2.

2 Network visualization using Cytoscape

2.1 Context

In deLomana et al. [2020], we studied a few aspects of Halobacterium salinarum translational regulation. One of the questions we needed to answer to support one of our hypotheses was:

  • Is transcription and translation coupled in H. salinarum?

We know that transcription and translation happen in different compartments in the eukaryotic cell: Transcription happens inside the nuclei. Translation happens in the cytoplasm.

In the prokaryotic cell, we don’t have a membrane to separate the genetic material from the cytoplasm, so transcription and translation are likely to happen simultaneously. Coupled transcription and translation is a fact in Bacteria. But what about Archaea?

Source: https://www.mun.ca/biology/scarr/iGen3_05-09.html

In the Baliga Lab, a few years ago, Marc Facciotti and his colleagues performed a target coimmunoprecipitation experiment to find out the proteins coupled to the general transcription factors of H. salinarum, that is proteins of the transcription machinery. [Facciotti et al. 2007].

Source: https://www.creativebiomart.net/resource/fast-chromatin-immunoprecipitation-assay-392.htm

Using those results, we checked if proteins of the translational machinery were present in the pulldown fractions when transcription proteins were used as baits. Indeed, we were able to find a few, supporting our hypothesis of coupled transcription and translation:

RPs physically interact with transcription complex components. Diamonds represent RPs; squares represent transcription complex components. Tagged proteins used as bait in the immunoprecipitation experiment are highlighted by a black border. Arrowheads link bait to coimmunoprecipitated proteins. We labeled each of the seven modules obtained by the Newman-Girvan clustering algorithm using a different color [de Lomana et al., 2020].

2.2 Data structure

2.2.1 Protein-protein interaction network

This is the structure of a Simple Interaction File (SIF)

2.2.2 Protein information

2.3 Installing Cytoscape and apps (or plugins)

1. Go to the Cytoscape download page. Download and install it.

2. Open Cytoscape program and install the following apps:

  • clusterMaker2
    Apps -> App Manager -> Search: clusterMaker2 -> Select listing -> Click on Install button
  • Color Cast
    Apps -> App Manager -> Search: Color Cast -> Select listing -> Click on Install button
  • yFiles Layout Algorithms
    Apps -> App Manager -> Search: yFiles -> Select listing -> Click on Install button

2.4 Plotting a protein-protein interaction network

1. Import the protein-protein interaction file ppi.sif.
File -> Import -> Network from URL ->
https://alanlorenzetti.github.io/dataVisSession2020/data/ppi.sif

2. Import the protein information table.
File -> Import -> Table from URL -> https://alanlorenzetti.github.io/dataVisSession2020/data/ppiFunCat.tsv ->
Where to Import Table Data: To selected networks only -> Click on OK button

3. Click on Style tab. Let’s change the design of our network.

  • Select the Sample1 preset style.
  • Change the label of nodes by selecting the label column.
  • Change the shape of nodes according to the functional class of proteins. Rectangles will represent Transcription class. Diamonds will represent Translation class.
  • Add a thick border to the nodes of proteins used as baits. Set size 2.
  • Make the nodes bigger. Select size 35.
  • Click on the Edge tab. Make the edges thicker (size 1.5) and black.
  • Add arrowheads (Target Arrow Shape; Delta) to the end of edges.
  • Remove edge labels.

4. Apply the Newman-Girvan modularity algorithm to find modules of highly interconnected proteins. Use the default parameters. Apps -> clusterMaker -> Community Cluster (GLay)

5. Change the color of nodes according to the modules. We will use a plugin called Color Cast to make our lives easier. Select __glayCluster as the target data column. We’ll apply the Set2 colors. Tools -> Color Cast -> Color Cast

6. Remove all the nodes not classified as Transcription or Translation.

  • Using the Node Table panel, order the class column and select all proteins of those classes.
  • Then right-click the selection and then click on Select nodes from selected rows.
  • Hide the unselected nodes. Select -> Nodes -> Hide Unselected Nodes.

7. Apply an automatic layout.

  • Select all proteins used as baits (those with borders).
  • Apply yFiles Hierarchic Layout Selected Nodes. Layouts -> Hierarchic Layout Selected Nodes.

8. Save your network and export as an image file.

  • File -> Save
  • File -> Export -> Network to Image

3 Data visualization in R using the ggplot2 package

3.1 Loading RStudio Cloud environment

1. Go to RStudio Cloud website

2. Log in with your Google account

3. Create a new project. Suggestion: DataVis

3.2 Installing ggplot2 package

# installing ggplot2 package
install.packages("ggplot2")

3.3 Loading libs and the table we will work with

# loading ggplot2
library(ggplot2)
theme_set(theme_bw())

# loading our dataframe
haloExp = read.delim("https://alanlorenzetti.github.io/dataVisSession2020/data/haloExpression.tsv")

3.4 Variable types

3.4.1 Discrete variables

  • Counts:

    • The number of times you cursed this damn virus this week (5)
    • Number of stars in the Milky Way (250 billion or 2.5 x 10^11)
    • Number of genes in a compact genome (2600)
    • Number of messenger RNA molecules in a bacterial cell (~1800) [Moran et al. 2013]
    • Length of a messenger RNA (1573 nucleotides)

3.4.2 Continuous variables

  • Measures:

    • Height of an individual (162.56 centimeters)
    • Weight of an individual (89.77 kilograms)
    • mRNA expression (1401.75 arbitrary expression unit)
    • Protein expression (1500.62 arbitrary expression unit)
  • Ratios:

    • GC content, i.e., frequency of Guanine or Cytosine bases in a transcript (e.g. 973/1573 = 0.619)

3.4.3 Categorical variables

  • Characteristics:
    • Color of eyes (black, brown, blue, green, red)
    • Biological process of a protein (Transcription, Cell Motility, Unknown function, etc).
    • Transcript length (short, mid, long)

3.5 Data structure

3.6 Syntax

ggplot(data = <YOUR_DATASET_HERE>, mapping = aes(x = <X_AXIS_VARIABLE>, y = <Y_AXIS_VARIABLE>))
+
geom_<TYPE_OF_CHART>
+
additional modifications

3.7 Visualizations

What kind of plot is suitable to my data? We have to think about the i) number of variables, ii) the type of variables, and iii) the goal of the visualization.

3.7.1 One variable (continuous or discrete)

3.7.1.1 Histogram

# plotting a histogram of transcript length
ggplot(data = haloExp, mapping = aes(x = length)) +
  geom_histogram()

3.7.1.2 Density curves

# plotting the density curve of transcript length
ggplot(data = haloExp, mapping = aes(x = length)) +
  geom_density()

# plotting density curves for each one of the transcript length categories
ggplot(data = haloExp, mapping = aes(x = length, color = length_category)) +
  geom_density()

# plotting density curves for each one of the transcript length categories
# one panel per length category
ggplot(data = haloExp, mapping = aes(x = length, color = length_category)) +
  geom_density(show.legend = FALSE) +
  facet_wrap(~ length_category)

3.7.2 One variable (categorical)

3.7.2.1 Bar plots

# plotting a bar chart for the protein biological classes
ggplot(data = haloExp, mapping = aes(x = biological_class)) +
  geom_bar() +
  coord_flip()

# plotting a bar chart for the protein biological classes
# colors were added to distinguish each one of the classes
ggplot(data = haloExp, mapping = aes(x = biological_class, fill = biological_class)) +
  geom_bar(show.legend = FALSE) +
  coord_flip()

3.7.3 Two variables (x = continuous or discrete; y = continuous or discrete)

3.7.3.1 Scatter plots

# plotting a scatter plot of protein expression in function of mRNA expression
ggplot(data = haloExp, mapping = aes(x = mRNA_expression, y = protein_expression)) +
  geom_point()

# plotting a scatter plot of protein expression in function of mRNA expression
# a trend line was added to show the linear correspondence of both
ggplot(data = haloExp, mapping = aes(x = mRNA_expression, y = protein_expression)) +
  geom_point() +
  geom_smooth(method = "lm")

# plotting a scatter plot of protein expression in function of mRNA expression
# a trend line was added to show the linear correspondence of both
# plots were separated in panels according to the transcript length category
ggplot(data = haloExp, mapping = aes(x = mRNA_expression, y = protein_expression)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_grid(~ length_category)

# plotting a scatter plot of protein expression in function of mRNA expression
# colors were added according to the GC content of the transcripts
ggplot(data = haloExp, mapping = aes(x = mRNA_expression, y = protein_expression, color = GC)) +
  geom_point()

3.7.4 Two variables (x = categorical; y = continuous or discrete)

3.7.4.1 Boxplot

What is a box plot?

# plotting a boxplot of GC content for the transcripts
# separated by biological classes
ggplot(data = haloExp, mapping = aes(y = GC, x = biological_class)) +
  geom_boxplot() +
  coord_flip()

3.7.5 Two variables (x = categorical; y = categorical)

3.7.5.1 Count plot

# plotting a count plot of length category and biological class
ggplot(data = haloExp, mapping = aes(y = length_category, x = biological_class)) +
  geom_count() +
  coord_flip()

3.7.6 Several variables of different types

3.7.6.1 3D scatter plots (x, y, z = continuous or discrete)

3.7.6.2 Heat maps (continuous, discrete, and categorical variables altogether)

3.8 Exercise

  • What other combinations of variables we can explore using the given data set?

4 Resources

5 References

DE LOMANA, A. et al. Selective Translation of Low Abundance and Upregulated Transcripts in Halobacterium salinarum. mSystems, v. 5, n. 4, 28 jul. 2020.

FACCIOTTI, M. T. et al. General transcription factor specified global gene regulation in archaea. Proceedings of the National Academy of Sciences of the United States of America, v. 104, n. 11, p. 4630–4635, 13 mar. 2007.

MORAN, M. A. et al. Sizing up metatranscriptomics. The ISME Journal, v. 7, n. 2, p. 237–243, fev. 2013.